Join two sheets, groupby and sum on the joined data



In [53]:

    
import pandas as pd



In [75]:

    
# load both sheets as new dataframes
shows_df = pd.read_csv("show_category.csv")
views_df = pd.read_excel("views.xls")

Change the index to be the showname



In [76]:

    
shows_df.head()









    Out[76]:






  
    
      
      showname
      category
    
  
  
    
      0
      firefly
      scifi
    
    
      1
      hollyoaks
      soap
    
    
      2
      battlestar
      scifi



In [77]:

    
shows_df = shows_df.set_index('showname')
shows_df.head()









    Out[77]:






  
    
      
      category
    
    
      showname
      
    
  
  
    
      firefly
      scifi
    
    
      hollyoaks
      soap
    
    
      battlestar
      scifi

Do the same for views

DANGER note that battle-star has a hyphen!



In [78]:

    
views_df.head()









    Out[78]:






  
    
      
      viewer_id
      show_watched
      views
    
  
  
    
      0
      23
      firefly
      2
    
    
      1
      29
      hollyoaks
      4
    
    
      2
      22
      firefly
      100
    
    
      3
      22
      battle-star
      10
    
    
      4
      22
      FIREFLY
      200



In [79]:

    
views_df = views_df.set_index('viewer_id')
views_df.head() # note that we can have repeating viewer_id values (they're non-unique)









    Out[79]:






  
    
      
      show_watched
      views
    
    
      viewer_id
      
      
    
  
  
    
      23
      firefly
      2
    
    
      29
      hollyoaks
      4
    
    
      22
      firefly
      100
    
    
      22
      battle-star
      10
    
    
      22
      FIREFLY
      200



In [80]:

    
# we can select out the column to work on, then use the built-in str (string) functions
# to replace hyphens (we do this and just print the result to screen)
views_df['show_watched'].str.replace("-", "")









    Out[80]:





viewer_id
23       firefly
29     hollyoaks
22       firefly
22    battlestar
22       FIREFLY
22      FIREFLY 
Name: show_watched, dtype: object



In [81]:

    
# now we do the fix in-place
views_df['show_watched'] = views_df['show_watched'].str.replace("-", "")
# NOTE if you comment out the line above, you'll get a NaN in the final table
# as `battle-star` won't be joined
views_df









    Out[81]:






  
    
      
      show_watched
      views
    
    
      viewer_id
      
      
    
  
  
    
      23
      firefly
      2
    
    
      29
      hollyoaks
      4
    
    
      22
      firefly
      100
    
    
      22
      battlestar
      10
    
    
      22
      FIREFLY
      200
    
    
      22
      FIREFLY
      300



In [82]:

    
print("Index info:", views_df.index)
views_df.ix[22] # select the items with index 22 (note this is an integer, not string value)









    



Index info: Int64Index([23, 29, 22, 22, 22, 22], dtype='int64', name='viewer_id')






    Out[82]:






  
    
      
      show_watched
      views
    
    
      viewer_id
      
      
    
  
  
    
      22
      firefly
      100
    
    
      22
      battlestar
      10
    
    
      22
      FIREFLY
      200
    
    
      22
      FIREFLY
      300

Join on shows watched against category

PROBLEM - we have NaN values for the last two Firefly entries. Can you do something, earlier on, to fix this, from here inside the Notebook?



In [83]:

    
shows_views_df = views_df.join(shows_df, on='show_watched')
shows_views_df









    Out[83]:






  
    
      
      show_watched
      views
      category
    
    
      viewer_id
      
      
      
    
  
  
    
      23
      firefly
      2
      scifi
    
    
      29
      hollyoaks
      4
      soap
    
    
      22
      firefly
      100
      scifi
    
    
      22
      battlestar
      10
      scifi
    
    
      22
      FIREFLY
      200
      NaN
    
    
      22
      FIREFLY
      300
      NaN



In [84]:

    
# take out two relevant columns, group by category, sum the views
shows_views_df[['views', 'category']].groupby('category').sum()

	viewer_id	show_watched	views
0	23	firefly	2
1	29	hollyoaks	4
2	22	firefly	100
3	22	battle-star	10
4	22	FIREFLY	200

	show_watched	views
viewer_id
23	firefly	2
29	hollyoaks	4
22	firefly	100
22	battle-star	10
22	FIREFLY	200

	show_watched	views	category
viewer_id
23	firefly	2	scifi
29	hollyoaks	4	soap
22	firefly	100	scifi
22	battlestar	10	scifi
22	FIREFLY	200	NaN
22	FIREFLY	300	NaN